<<<<<<< HEAD [Insert a concise and informative title here] ======= [Trends in Gender Participation and Gender Equality in Summer Olympics: 1896-2020] >>>>>>> 103afe631bd05b0cf645a264ad34b0e027198e03 <<<<<<< HEAD ======= >>>>>>> 103afe631bd05b0cf645a264ad34b0e027198e03
<<<<<<< HEAD

Prompt: [3]

ChatGPT/AI disclosure statement: [describe whether and how you used ChatGPT or other AI tools for this assignment. If you did not, please write “I did not use ChatGPT or other AI for this assignment.”]

1. A concise and engaging introduction

At the 2024 Olympics, a team known as “Individual Neutral Athletes” was established, allowing athletes from countries under international sanctions, such as Russia and Belarus, to compete fairly. However, the impact of geopolitical tensions on their training effectiveness and final performance remains uncertain. This event has highlighted the intricate relationship between national politics and international sports competitions. I have been employed by the International Olympic Committee (IOC) to compile new data and examine how the quality of national governance affects Olympic performance. The findings from this research will provide the IOC with insights necessary to make strategic decisions, especially concerning the reallocation of resources to enhance fairness and efficiency in future Olympic Games, thus ensuring the Olympics remain a neutral platform where sportsmanship triumphs over political strif

2. [Insert a section title here]

=======

Prompt: [2]

ChatGPT/AI disclosure statement: [When I use RSelenium, I cannot find the content outside the drop-down list. I asked chatGPT to solve it. He told me that I need to use JavaScript(row 245-258)]

1. [Introduction]

In 2024, the Olympic Games celebrated gender parity in athlete participation for the first time. This prompts vital questions: How has women’s participation evolved, and does their performance match that of men? Furthermore, how do these developments relate to gender equality?

As a PhD student in social sciences, my research focuses on data on the participation and performance of female athletes from different region throughout Olympic history, alongside statistics of global gender equality. This study seeks to understand the relationship between advancements in gender parity in sports and overall gender equality worldwide. The insights derived from this research will provide valuable perspectives for fostering more inclusive sporting environments globally. Before start, it’s a good idea to look at the following code: ### 1.1: Install nesssary packages

# You may need to run these codes because your environment may not have these packages
if (!require(knitr)) install.packages("knitr")
if (!require(rvest)) install.packages("rvest")
if (!require(dplyr)) install.packages("dplyr")
if (!require(tidyr)) install.packages("tidyr")
if (!require(purrr)) install.packages("purrr")
if (!require(tibble)) install.packages("tibble")
if (!require(readr)) install.packages("readr")
if (!require(stringr)) install.packages("stringr")
if (!require(httr)) install.packages("httr")
if (!require(jsonlite)) install.packages("jsonlite")
if (!require(keyring)) install.packages("keyring")
if (!require(RSelenium)) install.packages("RSelenium")
if (!require(ggplot2)) install.packages("ggplot2")
if (!require(grid)) install.packages("grid")
if (!require(fs)) install.packages("fs")
if (!require(png)) install.packages("png")
if (!require(DBI)) install.packages("DBI")
if (!require(RSQLite)) install.packages("RSQLite")
if (!require(plotly)) install.packages("plotly")

1.2: Set up the directory

# Run the following to set up the directory to ensure you have the correct path when starting from any code block
rowdata_ddir <- "rowdata"
if (!dir.exists(rowdata_ddir)) dir.create(rowdata_ddir)
output_ddir <- "output"
if (!dir.exists(output_ddir)) dir.create(output_ddir)
tidydata_ddir <- "output/tidydata"
if (!dir.exists(tidydata_ddir)) dir.create(tidydata_ddir)
readydata_ddir <- "output/readydata"
if (!dir.exists(readydata_ddir)) dir.create(readydata_ddir)
database_ddir <- "output/database"
if (!dir.exists(database_ddir)) dir.create(database_ddir)
vis_ddir <- "output/vis"
if (!dir.exists(vis_ddir)) dir.create(vis_ddir)

1.3: Random ports number

# If you need to re-scraping when using Rselenium, you will encounter port occupancy issues. In this case, run this code and replace the four numbers in the original "port" with the result.
random_port_number <- sample(1000:9999, 1) 
print(random_port_number)
## [1] 1532

2. [Olympic Data]

This study’s primary data includes historical Olympic medal information from Wikipedia and detailed participant statistics for each games from Olympedia. Medal tables were extracted via xpath and CSS. I also used the API from Kaggle(also use Olympedia as source) to get detailed data on athletes prior to 2016 and Rselenium to fill in the missing parts.See License. This is quicker, but my code still allows you to get first-hand data using Rselenium.The data are primary because they not only help to analyze the gender disparities and participation trends in different countries over time, but also to lay the foundation for visualization. Our scraping script is designed with good scraping etiquette. Plus, I’ve stored the API key in advance, so you don’t need to know it to use the code. Due to political disputes, participating regions cannot be mapped one by one to the countries. I will rename them as “region” in the following sections. Do not use for visual mapping.

2.1: Get Olympic medal tables from wiki

>>>>>>> 103afe631bd05b0cf645a264ad34b0e027198e03
library(rvest)
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
library(readr)
<<<<<<< HEAD

# STEP 1: Scrape all Wikipedia pages of Summer Olympics medal table 
# Manually specify Olympic years, considering historical three interruptions cause by World War
# P.S. Although the Tokyo Olympics was actually be held in 2021, the website shows it as 2020
olympic_years <- c(seq(1896, 1912, by = 4), 1920, seq(1924, 1936, by = 4), 1948, seq(1952, 2016, by = 4), 2020, 2024)

# Base URL for Wikipedia Olympic medal tables
base_url <- "https://en.wikipedia.org/wiki/"

# List to store HTML content of each Olympic medal table page
=======
library(stringr)


rowdata_ddir <- "rowdata"
if (!dir.exists(rowdata_ddir)) dir.create(rowdata_ddir)


# STEP 1: Scrape all Wikipedia pages of Summer Olympics medal table 


# Manually specify Olympic years, considering historical three interruptions cause by World War
# P.S. Although the Tokyo Olympics was actually be held in 2021, the website shows it as 2020
olympic_years <- c(seq(1896, 1912, by = 4), 1920, seq(1924, 1936, by = 4), 1948, seq(1952, 2016, by = 4), 2020, 2024)
base_url <- "https://en.wikipedia.org/wiki/"
>>>>>>> 103afe631bd05b0cf645a264ad34b0e027198e03
olympic_pages <- list()

# Scrape pages for all specified Olympic years
for (year in olympic_years) {
    url_suffix <- sprintf("%d_Summer_Olympics_medal_table#Medal_table", year)
    olympic_url <- paste0(base_url, url_suffix)
<<<<<<< HEAD
    olympic_pages[[as.character(year)]] <- read_html(olympic_url)# Fetch and store HTML content
    Sys.sleep(2)# Pause to respect the server's load
}


# STEP 2: Extract medal tables and host coutry data using XPath 
# Define XPaths based on pages of different years
special_years <- c(1936, 1948, 1960, 1964, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016, 2020, 2024)
xpath_years_default <- '//*[@id="mw-content-text"]/div[1]/table[2]/tbody'
xpath_years_special <- '//*[@id="mw-content-text"]/div[1]/table[3]/tbody'

# To store extracted tables
medal_tables <- list() 

# Locate tables with the specified structures
for (year in names(olympic_pages)) {
    page <- olympic_pages[[year]]
    # Determine the correct XPath based on the year
    xpath <- if (as.numeric(year) %in% special_years) xpath_years_special else xpath_years_default
    table_elements <- page %>% html_elements(xpath = xpath)  
    # Output message if fail for further progress
    if (length(table_elements) > 0 && !any(inherits(table_elements, "xml_missing"))) {
        medal_tables[[year]] <- html_table(table_elements[[1]], fill = TRUE)
    } else {
        message(sprintf("No medal table found for year %s, skipping...", year))
    }
    Sys.sleep(2)# Pause to respect the server's load
}

# Prepare a list to store host countries
host_countries_data <- list()

# Scrape host country data using a uniform CSS selector
for (year in names(olympic_pages)) {
    page <- olympic_pages[[year]]
    # Check for special conditions
    if (as.numeric(year) == 1956) {
        host_countries_data[[year]] <- "Australia"  # Manually set for 1956,which has different css
    } else if (as.numeric(year) == 2024) {
        host_countries_data[[year]] <- "France"     # Manually set for 2024, which has different css
    } else {
        # Fetch host country using a uniform CSS selector and store it
        host_country_node <- page %>% html_node(".infobox-data.location a")
        host_country <- if (!is.null(host_country_node)) html_text(host_country_node, trim = TRUE) else "Unknown"
        host_countries_data[[year]] <- host_country
    }
    Sys.sleep(1)  # Pause to respect the server's load
}

 

# Step 3: Process and clean extracted medal_table data
=======
    olympic_pages[[as.character(year)]] <- read_html(olympic_url)
    Sys.sleep(1)
}


# STEP 2: Extract medal tables  using XPath 


special_years <- c(1936, 1948, 1960, 1964, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016, 2020, 2024)
xpath_years_default <- '//*[@id="mw-content-text"]/div[1]/table[2]/tbody'
xpath_years_special <- '//*[@id="mw-content-text"]/div[1]/table[3]/tbody'
medal_tables <- list() 
for (year in names(olympic_pages)) {
    page <- olympic_pages[[year]]
    xpath <- if (as.numeric(year) %in% special_years) xpath_years_special else xpath_years_default
    table_elements <- page %>% html_elements(xpath = xpath)  
    if (length(table_elements) > 0 && !any(inherits(table_elements, "xml_missing"))) {
        table <- html_table(table_elements[[1]], fill = TRUE)
        # Specific year check for manual correction
        if (year == "1908") {
            # Correct "Great Britain" name issue by removing unwanted text
            table$Nation <- gsub("Great Britain.*", "Great Britain", table$Nation)
        }
        medal_tables[[year]] <- table
    } else {
        message(sprintf("No medal table found for year %s, skipping...", year))
    }
    Sys.sleep(1)
}


# STEP 3: Extract host countries 


host_url <- "https://en.wikipedia.org/wiki/List_of_Olympic_Games_host_cities"
host_country <- read_html(host_url) %>%
  html_element(xpath = '//*[@id="mw-content-text"]/div[1]/table[1]') %>%
  html_table(fill = TRUE) %>%
  select(Country, Year) %>%
  mutate(
    Year = as.numeric(Year),
    Country = str_remove(Country, "\\[.*?\\]") 
  ) %>%
  filter(Year %in% olympic_years)
host_countries <- host_country %>% deframe() 


# STEP 4: Process and clean extracted medal_table data


>>>>>>> 103afe631bd05b0cf645a264ad34b0e027198e03
# Initialize an empty tibble for storing results
olympic_medal_data <- tibble(
  olympic_year = integer(),
  olympic_rank = integer(),
  country_participate = character(),
  medal_gold = integer(),
  medal_silver = integer(),
  medal_bronze = integer(),
  medal_total = integer(),
  country_host = integer())

# Process each page's medal table and integrate host country data
for (year in names(medal_tables)) {
    table <- medal_tables[[year]]
<<<<<<< HEAD
   # Remove the last row which usually contains totals
    if (nrow(table) > 0) {
        table <- table[-nrow(table), ]
    }
    
    # Check and find the table has 'NOC' or 'Nation' as the column name and standardize it
    if ("NOC" %in% names(table)) {
      country_col_name <- "NOC"
    } else if ("Nation" %in% names(table)) {
      country_col_name <- "Nation"
    } 
    
   # Transform the HTML table into a cleaned and structured tibble
    medal_data <- table %>%
        rename(country_participate = {{country_col_name}}) %>%
        mutate(
          olympic_year = as.integer(year),
          olympic_rank = as.integer(Rank),
          country_participate = gsub("\\[.*?\\]|\\*|‡|\\(.*?\\)", "", country_participate),  # Remove annotations, superscripts, and content in parentheses
          medal_gold = as.integer(Gold),
          medal_silver = as.integer(Silver),
          medal_bronze = as.integer(Bronze),
          medal_total = as.integer(Total),
          country_host = ifelse(country_participate == host_countries_data[[year]], 1, 0)
        ) %>%
        select(olympic_year, olympic_rank, country_participate, medal_gold, medal_silver, medal_bronze, medal_total, country_host)

    # Append this year's data to the main Olympic medal dataset
    olympic_medal_data <- bind_rows(olympic_medal_data, medal_data)
}

# Print the final structured data for verification
head(olympic_medal_data)
## # A tibble: 6 × 8
##   olympic_year olympic_rank country_participate medal_gold medal_silver
##          <int>        <int> <chr>                    <int>        <int>
=======
    # Remove the last row which contains totals
    table <- table[-nrow(table), ]
    table <- table %>%
        rename_with(~ if_else(.x %in% c("NOC", "Nation"), "country_participate", .x)) %>%
        filter(country_participate != "Mixed team") # Remove non-nation variables
    
# Clean and convert data types
medal_data <- table %>%
        mutate(
            medal_gold = as.numeric(Gold),
            medal_silver = as.numeric(Silver),
            medal_bronze = as.numeric(Bronze),
            medal_total = as.numeric(Total), 
            olympic_year = as.integer(year),
            olympic_rank = as.integer(Rank),
            country_participate = gsub("\\[.*?\\]|\\*|‡|\\u00a0", "", country_participate) %>% trimws(),
            country_participate = if_else(
                olympic_year %in% c(1932, 1960), 
                gsub("[^A-Za-z ]", "", country_participate) %>%
                gsub("[A-Z]{3}$", "", .) %>% 
                trimws(),
                country_participate ),
            country_host = ifelse(
                country_participate == host_country %>% filter(Year == as.numeric(year)) %>% pull(Country), 1, 0)) %>%
        select( olympic_year, olympic_rank, country_participate,
            medal_gold, medal_silver, medal_bronze, medal_total, country_host)
    olympic_medal_data <- bind_rows(olympic_medal_data, medal_data)
}
write_csv(olympic_medal_data, file.path(rowdata_ddir, "olympic_medal_data.csv"))
print(head(olympic_medal_data))
## # A tibble: 6 × 8
##   olympic_year olympic_rank country_participate medal_gold medal_silver
##          <int>        <int> <chr>                    <dbl>        <dbl>
>>>>>>> 103afe631bd05b0cf645a264ad34b0e027198e03
## 1         1896            1 United States               11            7
## 2         1896            2 Greece                      10           18
## 3         1896            3 Germany                      6            5
## 4         1896            4 France                       5            4
## 5         1896            5 Great Britain                2            3
## 6         1896            6 Hungary                      2            1
<<<<<<< HEAD
## # ℹ 3 more variables: medal_bronze <int>, medal_total <int>, country_host <dbl>
write_csv(olympic_medal_data, "olympic_medal_data.csv")  # Save results as a CSV file

The primary data for this study comprises historical Olympic medal information sourced from publicly accessible Wikipedia pages for each Summer Olympic event from 1896(earliest) to 2024(latest). The choice of Wikipedia is due to its comprehensive and publicly reviewed datasets. This data was programmatically retrieved using R with the rvest package.Through Xpath and CSS, we extract and parse HTML pages to retrieve medal tables and country details.This data is primary because it not only serves as a foundation for assessing national performances across Olympics but also supports integrated analysis with political datasets and underpins the creation of visual maps. One limitation is the inherent untidiness of the scraped data, requiring futher preprocessing in sections 5 and 6 below. Moreover, the XPath used for scraping may vary, especially with future Olympic pages, necessitating periodic script adjustments to maintain data accuracy. To address potential ethical concerns and align with good scraping etiquette, our data collection process includes measures such as adhering to Wikipedia’s robots.txt, implementing Sys.sleep() to moderate our request rate, and use list to store the result of running read_html() only once, reducing the number of requests. Please follow these parts.

3. [Insert a section title here]

[The text and code for this section goes here.]

4. [Insert a section title here]

[The text and code for this section goes here.]

5. [Insert a section title here]

[The text and code for this section goes here.]

6. [Insert a section title here]

[The text and code for this section goes here.]

======= ## # ℹ 3 more variables: medal_bronze <dbl>, medal_total <dbl>, country_host <dbl>

This is the historical medal table obtained from wiki.

2.2 : Get Olympic participant data and results

2.2.1: Get athletes and medal results from Athens 1896 to Rio 2016

library(httr)
library(jsonlite)
library(tidyverse)
library(keyring)
library(tibble)
library(RSelenium)
# NOTE: Basic bio data on athletes and medal results from Athens 1896 to Rio 2016 have been published on the [Kaggle](https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results) , and it has been generously allowed to be used for free via API. Direct use not only reduces the workload, but also reduces the pressure on crawling the server, so I will only crawl data for 2020.But my steps can be used for all years

# Retrieve the API key from keyring
# For the purpose of password protection, I have already set the API key, so you don't have to do step in the next row
# keyring::key_set(service = "Kaggle", username = "olympic-api-key")

api_key <- keyring::key_get(service = "Kaggle", username = "olympic-api-key")

# Download zip file
url <- "https://www.kaggle.com/api/v1/datasets/download/heesoo37/120-years-of-olympic-history-athletes-and-results"
output_file <- "rowdata/120-years-of-olympic-history-athletes-and-results.zip"
response <- GET(url, add_headers(Authorization = paste("Bearer", api_key)), write_disk(output_file, overwrite = TRUE))

# Unzip the file
zip_path <- "rowdata/120-years-of-olympic-history-athletes-and-results.zip"
extraction_path <- "rowdata/120-years-of-olympic-history-athletes-and-results"
unzip(zipfile = zip_path, exdir = extraction_path)
extracted_files <- list.files(extraction_path, full.names = TRUE)

# Load the file
olympic_athletes_rowdata <- read_csv(extracted_files[1])
NOC_rowdata <- read_csv(extracted_files[2])

# Convert the  DataFrame to a tibble
olympic_athletes_rowdata <- as_tibble(olympic_athletes_rowdata)
NOC_rowdata <- as_tibble(NOC_rowdata)

# Merge two file and filter by Olympic years (summer)
mapping <- NOC_rowdata %>%
  select(NOC, region) %>%
  distinct()
olympic_athletes_data <- olympic_athletes_rowdata %>%
  left_join(mapping, by = "NOC") %>%
  mutate(Team = ifelse(is.na(region), Team, region)) %>%
  filter(Year %in% olympic_years) %>%
  select(Name, Sex, Team, Year, Sport, Event, Medal)

# Filter by female
olympic_female_athletes_data <- olympic_athletes_data %>%
  filter(Sex == "F")%>%
  select(-Sex)
print(head(olympic_female_athletes_data))
## # A tibble: 6 × 6
##   Name                                   Team         Year Sport     Event Medal
##   <chr>                                  <chr>       <dbl> <chr>     <chr> <chr>
## 1 "Christine Jacoba Aaftink"             Netherlands  1988 Speed Sk… Spee… <NA> 
## 2 "Christine Jacoba Aaftink"             Netherlands  1988 Speed Sk… Spee… <NA> 
## 3 "Christine Jacoba Aaftink"             Netherlands  1992 Speed Sk… Spee… <NA> 
## 4 "Christine Jacoba Aaftink"             Netherlands  1992 Speed Sk… Spee… <NA> 
## 5 "Cornelia \"Cor\" Aalten (-Strannood)" Netherlands  1932 Athletics Athl… <NA> 
## 6 "Cornelia \"Cor\" Aalten (-Strannood)" Netherlands  1932 Athletics Athl… <NA>

This table shows information about female Olympic competitors throughout history

2.2.2: Get medal results from Olympedia at 2020

# Start RSelenium Driver for Firefox
driver <- rsDriver(browser = "firefox", port = 4445L, verbose = FALSE)
remote_driver <- driver$client

# Navigate to the Olympedia website
url <- "https://www.olympedia.org/statistics/medal/athlete"
remote_driver$navigate(url)
Sys.sleep(3)

# Define all valid year positions in the dropdown
year_positions <- c(1:3, 5:26, 53, 54, 59, 61) 

# Initialize a variable to store only 2020 data
medal_data_2020 <- NULL

# Loop through all valid year positions
for (pos in year_positions) {
  # Use JavaScript to directly select the year based on position
  tryCatch({
    ### * the following lines of code were generated by AI/ChatGPT 
    remote_driver$executeScript(sprintf("
      var dropdown = document.getElementById('edition_select');
      dropdown.value = %d;  // Select the option with the given value
      var event = new Event('change');  // Trigger a change event
      dropdown.dispatchEvent(event);
    ", pos))
    Sys.sleep(2) 
  }, error = function(e) {
    next
  })
  
  # Select "Female" in the gender dropdown
  tryCatch({
    gender_dropdown <- remote_driver$findElement(using = 'xpath', value = '//*[@id="gender"]')
    gender_dropdown$clickElement()
    Sys.sleep(0.5)
    female_option <- remote_driver$findElement(using = 'xpath', value = '//*[@id="gender"]/option[3]')
    female_option$clickElement()
    Sys.sleep(2)  
  }, error = function(e) {
    next
  })
  
  # Extract the page source and parse the medal table
  tryCatch({
    page_source <- remote_driver$getPageSource()[[1]]
    page_html <- read_html(page_source)
    medal_table <- page_html %>%
      html_table(fill = TRUE) %>%
      .[[1]]
    year_text <- sprintf("Position %d", pos)
    medal_table <- medal_table %>% mutate(Year_Position = year_text)
    
    # If it's 2020 (position 61), store the data
    if (pos == 61) {
      medal_data_2020 <- medal_table
    }
  }, error = function(e) {
    next
  })
}

# Save only the 2020 medal data to a CSV file
if (!is.null(medal_data_2020)) {
  write_csv(medal_data_2020,  file.path(rowdata_ddir,"Olympic_Female_Medals_2020.csv"))
}
print(head(medal_data_2020))
## # A tibble: 6 × 7
##   Athlete               Nat    Gold Silver Bronze Total Year_Position
##   <chr>                 <chr> <int>  <int>  <int> <int> <chr>        
## 1 Emma McKeon           AUS       4      0      2     6 Position 61  
## 2 Lisa Carrington       NZL       3      0      0     3 Position 61  
## 3 Elaine Thompson-Herah JAM       3      0      0     3 Position 61  
## 4 Kaylee McKeown        AUS       3      0      0     3 Position 61  
## 5 Katie Ledecky         USA       2      2      0     4 Position 61  
## 6 Ariarne Titmus        AUS       2      1      1     4 Position 61
# Close the RSelenium session
remote_driver$close()
driver$server$stop()
## [1] TRUE

This table shows the women who won the 2020 award

2.2.3: Get athletes at 2020

# Because I've already show you about how to get all the years with RSelenium, I won't repeat it again here, just grab 2020 directly.
url2020 <- "https://www.olympedia.org/counts/edition/61"
html_content <- read_html(url2020)
tab <- html_table(html_content, fill = TRUE)
olympic_participators2020 <- tab[[1]]

write_csv(olympic_participators2020,  file.path(rowdata_ddir,"olympic_participators2020.csv"))
print(head(olympic_participators2020))
## # A tibble: 6 × 102
##   X1    X2    X3    X4    X5    X6    X7    X8    X9    X10   X11   X12   X13  
##   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ""    ARC   ARC   ATH   ATH   BBL   BBL   BDM   BDM   BK3   BK3   BKB   BKB  
## 2 ""    m     w     m     w     m     w     m     w     m     w     m     w    
## 3 "AFG" -     -     1     1     -     -     -     -     -     -     -     -    
## 4 "ALB" -     -     1     1     -     -     -     -     -     -     -     -    
## 5 "ALG" -     -     4     1     -     -     -     -     -     -     -     -    
## 6 "AND" -     -     1     -     -     -     -     -     -     -     -     -    
## # ℹ 89 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>, X18 <chr>,
## #   X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>, X23 <chr>, X24 <chr>,
## #   X25 <chr>, X26 <chr>, X27 <chr>, X28 <chr>, X29 <chr>, X30 <chr>,
## #   X31 <chr>, X32 <chr>, X33 <chr>, X34 <chr>, X35 <chr>, X36 <chr>,
## #   X37 <chr>, X38 <chr>, X39 <chr>, X40 <chr>, X41 <chr>, X42 <chr>,
## #   X43 <chr>, X44 <chr>, X45 <chr>, X46 <chr>, X47 <chr>, X48 <chr>,
## #   X49 <chr>, X50 <chr>, X51 <chr>, X52 <chr>, X53 <chr>, X54 <chr>, …

This table shows the 2020 athletes of all genders

3. [GII data]

Using RSelenium, I automatically download the JSON of Gender Inequality Index (GII) from UNDP.I also downloaded and plotted a PNG of the GII definition to illustrate what it complements. This secondary data enhances our understanding of the socio-economic factors influencing Olympic sports.I envision that performing multiple linear regression analysis using various dimensions from the GII along with the primary data will yield interesting results.

3.1: Download GII JSON for all countries

library(readxl)
library(png)
library(ggplot2)
library(grid)
library(fs)

custom_download_dir <- "rowdata"

# Set Firefox preferences
firefox_profile <- list(
  "browser.download.folderList" = 2,
  "browser.download.dir" = custom_download_dir,
  "browser.helperApps.neverAsk.saveToDisk" = "application/json" 
)

# Start Selenium driver with Firefox
driver <- rsDriver(
  browser = "firefox",
  port = 5482L,
  verbose = FALSE,
  extraCapabilities = list(
    "moz:firefoxOptions" = list(
      prefs = firefox_profile
    )
  )
)
remote_driver <- driver$client

# Navigate to the target URL
gii_url <- "https://hdr.undp.org/data-center/documentation-and-downloads"
remote_driver$navigate(gii_url)
Sys.sleep(3)

# Simulate four click and select process
filter_by_index_xpath <- '//*[@id="rc_select_0"]'
filter_by_index_input <- remote_driver$findElement(using = "xpath", value = filter_by_index_xpath)
filter_by_index_input$clickElement()  # Click to activate the dropdown
Sys.sleep(1)  
filter_by_index_input$sendKeysToElement(list("GII"))  # Type "GII"
Sys.sleep(1)
filter_by_index_input$sendKeysToElement(list(key = "enter"))  # Press Enter to confirm selection
Sys.sleep(1)

filter_by_indicator_xpath <- '//*[@id="rc_select_1"]'
filter_by_indicator_input <- remote_driver$findElement(using = "xpath", value = filter_by_indicator_xpath)
filter_by_indicator_input$clickElement()  # Click to activate the dropdown
Sys.sleep(1)  
filter_by_indicator_input$sendKeysToElement(list("Gender"))  # Type "Gender" to choose Gender Inequality Index 
Sys.sleep(1)
filter_by_indicator_input$sendKeysToElement(list(key = "enter"))  # Press Enter to confirm selection
Sys.sleep(1)

filter_by_year_xpath <- '//*[@id="rc_select_2"]'
filter_by_year_input <- remote_driver$findElement(using = "xpath", value = filter_by_year_xpath)
filter_by_year_input$clickElement()  # Click to activate the dropdown
Sys.sleep(1) 
filter_by_year_input$sendKeysToElement(list("Select All"))  # Type "Select All" to choose all years
Sys.sleep(1)
filter_by_year_input$sendKeysToElement(list(key = "enter"))  # Press Enter to confirm selection
Sys.sleep(1)

filter_by_region_xpath <- '//*[@id="rc_select_3"]'
filter_by_region_input <- remote_driver$findElement(using = "xpath", value = filter_by_region_xpath)
filter_by_region_input$clickElement()  # Click to activate the dropdown
Sys.sleep(1) 
filter_by_region_input$sendKeysToElement(list("Select All Countries"))  # Type "Select All Countries" to choose all 
Sys.sleep(1)
filter_by_region_input$sendKeysToElement(list(key = "enter"))  # Press Enter to confirm selection
Sys.sleep(1)

# Simulate clicking the "search" button
search_button_xpath <- '//*[@id="root"]/div/div/div[1]/div[5]'
search_button <- remote_driver$findElement(using = "xpath", value = search_button_xpath)
search_button$clickElement() 
Sys.sleep(3)

# Simulate clicking the "download" button
json_download_xpath <- '//*[@id="root"]/div/div/div[3]/div[1]/div/div[2]/button'
json_download_button <- remote_driver$findElement(using = "xpath", value = json_download_xpath)
json_download_button$clickElement()
Sys.sleep(5)

# Close the new remote driver
remote_driver$close()
driver$server$stop()
## [1] TRUE

3.2: Download GII JSON for world

# Start a new Selenium driver with a new port
new_driver <- rsDriver(
  browser = "firefox",
  port = 1482L,  # Use a different port
  verbose = FALSE,
  extraCapabilities = list(
    "moz:firefoxOptions" = list(
      prefs = firefox_profile
    )
  )
)
new_remote_driver <- new_driver$client

# Navigate to the target URL
new_remote_driver$navigate(gii_url)
Sys.sleep(3)

# Simulate four click and select process
filter_by_index_xpath <- '//*[@id="rc_select_0"]'
filter_by_index_input <- new_remote_driver$findElement(using = "xpath", value = filter_by_index_xpath)
filter_by_index_input$clickElement()  # Click to activate the dropdown
Sys.sleep(1)
filter_by_index_input$sendKeysToElement(list("GII"))  # Type "GII"
Sys.sleep(1)
filter_by_index_input$sendKeysToElement(list(key = "enter"))  # Press Enter to confirm selection
Sys.sleep(1)

filter_by_indicator_xpath <- '//*[@id="rc_select_1"]'
filter_by_indicator_input <- new_remote_driver$findElement(using = "xpath", value = filter_by_indicator_xpath)
filter_by_indicator_input$clickElement()  # Click to activate the dropdown
Sys.sleep(1)
filter_by_indicator_input$sendKeysToElement(list("Gender")) # Type "Gender" to choose Gender Inequality Index 
Sys.sleep(1)
filter_by_indicator_input$sendKeysToElement(list(key = "enter"))  # Press Enter to confirm selection
Sys.sleep(1)

filter_by_year_xpath <- '//*[@id="rc_select_2"]'
filter_by_year_input <- new_remote_driver$findElement(using = "xpath", value = filter_by_year_xpath)
filter_by_year_input$clickElement()  # Click to activate the dropdown
Sys.sleep(1)
filter_by_year_input$sendKeysToElement(list("Select All"))  # Type "Select All" to choose all years
Sys.sleep(1)
filter_by_year_input$sendKeysToElement(list(key = "enter"))  # Press Enter to confirm selection
Sys.sleep(1)

filter_by_region_xpath <- '//*[@id="rc_select_3"]'
filter_by_region_input <- new_remote_driver$findElement(using = "xpath", value = filter_by_region_xpath)
filter_by_region_input$clickElement()  # Click to activate the dropdown
Sys.sleep(1)
filter_by_region_input$sendKeysToElement(list("World"))  
Sys.sleep(1)
for (i in 1:5) {  # Adjust the number of iterations based on the position of the desired "World" option
  filter_by_region_input$sendKeysToElement(list(key = "down_arrow"))
  Sys.sleep(0.5)
}
filter_by_region_input$sendKeysToElement(list(key = "enter"))  # Confirm selection with Enter
Sys.sleep(1)


# Simulate clicking the "search" button
search_button_xpath <- '//*[@id="root"]/div/div/div[1]/div[5]'
search_button <- new_remote_driver$findElement(using = "xpath", value = search_button_xpath)
search_button$clickElement()
Sys.sleep(3)

# Simulate clicking the "download" button
json_download_xpath <- '//*[@id="root"]/div/div/div[3]/div[1]/div/div[2]/button'
json_download_button <- new_remote_driver$findElement(using = "xpath", value = json_download_xpath)
json_download_button$clickElement() 
Sys.sleep(10)

# Close the new remote driver
new_remote_driver$close()
new_driver$server$stop()
## [1] TRUE

3.3: Download the Explanatory GII Image and plot it

# Download the Explanatory GII Image and plot it
img_url <- "https://hdr.undp.org/sites/default/files/styles/1400x/public/images/2022-05/GII_diagram.png?itok=fyvRbzzb"
downloaded_img_path <- file.path(custom_download_dir, "GII_diagram.png")
GET(img_url, write_disk(downloaded_img_path, overwrite = TRUE))
## Response [https://hdr.undp.org/sites/default/files/styles/1400x/public/images/2022-05/GII_diagram.png?itok=fyvRbzzb]
##   Date: 2025-01-15 16:59
##   Status: 200
##   Content-Type: image/png
##   Size: 61.5 kB
## <ON DISK>  /Users/dongyao/Desktop/my472-at24-final-dongyaook/rowdata/GII_diagram.png
# Read the PNG
img_to_plot <- png::readPNG(downloaded_img_path)

# Display the Image Using ggplot2
ggplot() +
  annotation_custom(rasterGrob(img_to_plot), xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf) +
  theme_void() +
  ggtitle("Explanatory GII Diagram")

> This PNG explains the GII Dimensions and Indicators

4. [Prepare before analysis]

4.1: Make data tidy

I structured the data to adhere to the principles of tidy data. I created unique IDs, standardized column names, integrated datasets, and transformed data into a long format.

output_ddir <- "output"
if (!dir.exists(output_ddir)) dir.create(output_ddir)

tidydata_ddir <- "output/tidydata"
if (!dir.exists(tidydata_ddir)) dir.create(tidydata_ddir)

#### ORIGINAL DATA 1:olympic_medal_data 


olympic_medal_data <- read_csv("rowdata/olympic_medal_data.csv")

# It seems olympic_medal_data is already tidy.But we can add a column of IDs consisting of years and NOCs to make the table look neater
# Fix the formatting errors in 1932 and 1960
data_to_modify <- olympic_medal_data %>%
  filter(olympic_year %in% c(1932, 1960)) %>%
  mutate(country_participate = trimws(country_participate, which = "right"))
olympic_medal_data <- olympic_medal_data %>%
  filter(!(olympic_year %in% c(1932, 1960))) %>%
  bind_rows(data_to_modify)

#  Make orrections
noc_data <- read_csv("rowdata/120-years-of-olympic-history-athletes-and-results/noc_regions.csv") %>%
  # Filter out non-country NOC codes
filter(!NOC %in% c("ANZ", "NFL", "BOH", "TCH", "SAA", "GDR", "FRG", "CRT", "MAL","NBO", "ROT", "TUV", "UNK", "IOA", "NOC", "EUN", "URS", "YUG","SCG", "WIF", "VNM", "YAR", "YMD", "RHO", "HKG")) %>% 
# Update alterntive notes
 mutate(notes = case_when(
    NOC == "GBR" & is.na(notes) ~ "Great Britain",
    NOC == "USA" & is.na(notes) ~ "United States",
    NOC == "RUS" & is.na(notes) ~ "Russian Empire, Soviet Union",
    NOC == "SRI" & is.na(notes) ~ "Ceylon",
    NOC == "GER" & is.na(notes) ~ "United Team of Germany, East Germany, West Germany",
    NOC == "AUS" & is.na(notes) ~ "Australasia",
    NOC == "CZE" & is.na(notes) ~ "Czechoslovakia, Bohemia",
    NOC == "SRB" & is.na(notes) ~ "Yugoslavia",
    NOC == "TEP" & is.na(notes) ~ "Chinese Taipei",
    TRUE ~ notes
  )) %>%
  separate_rows(notes, sep = ", ") %>%
  mutate(notes = trimws(notes))

# Prepare versions of noc_data for flexible joining
combined_noc_data <- noc_data %>%
  select(NOC, country_participate = region) %>%
  bind_rows(noc_data %>% select(NOC, country_participate = notes))

# Join on combined data and generate ID column
olympic_medal_data <- olympic_medal_data %>%
  left_join(combined_noc_data, by = "country_participate") %>%
  mutate(
    id = ifelse(is.na(NOC), paste(olympic_year, "Unknown", sep = "_"), paste(olympic_year, NOC, sep = "_"))
  ) %>%
  select(id, everything())

# Rename 'country_participate' to 'region_participate' as there may be unsolved geographical disputes
tidy_olympic_medal_data <- olympic_medal_data %>%
  rename(region_participate = country_participate)

write_csv(tidy_olympic_medal_data, file.path(tidydata_ddir, "tidy_olympic_medal_data.csv"))



# ORIGINAL DATA 2:Olympic_Female_Medals_2020 


olympic_female_medals2020 <- read_csv("rowdata/Olympic_Female_Medals_2020.csv")

# It seems Olympic_Female_Medals_2020  is already tidy.But can do some corrections about formatting
tidy_olympic_female_medals2020 <- olympic_female_medals2020 %>%
  rename(Year = Year_Position) %>% 
  mutate(Year = ifelse(Year == "Position 61", 2020, as.numeric(Year)))

write_csv(tidy_olympic_female_medals2020, file.path(tidydata_ddir, "tidy_olympic_female_medals2020.csv"))


# ORIGINAL DATA 3:athlete_events


# Load the data
athlete_events <- read_csv("rowdata/120-years-of-olympic-history-athletes-and-results/athlete_events.csv")

# It seems Olympic_Female_Medals_2020 is untidy.
# Prepare the data by replacing NA with "None" and fiilter Year
athlete_events <- mutate(athlete_events, Year = as.integer(Year))
summer_athletes <- athlete_events %>%
  filter(Year %in% olympic_years) %>%
  mutate(Medal = ifelse(is.na(Medal), "None", Medal)) %>%
  select(ID, Name, Sex, NOC, Year, Medal)

# Make the table wider by the function summarise
tidy_athlete_events <- summer_athletes %>%
  mutate(
    Gold = as.integer(Medal == "Gold"),
    Silver = as.integer(Medal == "Silver"),
    Bronze = as.integer(Medal == "Bronze")
  ) %>%
  group_by(ID, Name, Sex, NOC, Year) %>%
  summarise(
    Gold = sum(Gold, na.rm = TRUE),
    Silver = sum(Silver, na.rm = TRUE),
    Bronze = sum(Bronze, na.rm = TRUE),
    .groups = "drop"
  )

write_csv(tidy_athlete_events, file.path(tidydata_ddir, "tidy_athlete_events.csv"))


# ORIGINAL DATA 4:olympic_participators2020


olympic_participators2020 <- read_csv("rowdata/olympic_participators2020.csv")

# It seems Olympic_Female_Medals_2020 is untidy.
# Extract headers and clean the data
headers <- paste0(olympic_participators2020[1, ], "_", olympic_participators2020[2, ])
headers <- gsub("_NA", "", headers) 
headers[1] <- "Country"
olympic_participators2020 <- olympic_participators2020[-c(1, 2), ]
colnames(olympic_participators2020) <- headers

# Pivot it to longer
tidy_olympic_participators2020 <- olympic_participators2020 %>%
  pivot_longer(
    cols = -Country,
    names_to = c("Event", "Gender"),
    names_sep = "_",
    values_to = "Participants"
  ) %>%
  mutate(
    Participants = as.numeric(Participants),
    Gender = recode(Gender, m = "Male", w = "Female")
  ) %>%
  drop_na(Participants)

# Generate an ID column
tidy_olympic_participators2020 <- tidy_olympic_participators2020 %>%
  mutate(id = paste0("2020_", Country)) %>%
  select(id, everything())

write_csv(tidy_olympic_participators2020, file.path(tidydata_ddir, "tidy_olympic_participators2020.csv"))


# ORIGINAL DATA 5:GII from JSON


GII_json <- fromJSON("rowdata/hdr-data.json") 
gii_data <- as_tibble(GII_json)

GII_world_json <- fromJSON("rowdata/hdr-data(1).json") 
gii_world_data <- as_tibble(GII_world_json)

# It seems they are already tidy.But for future convenience, we will also add ID here following the previous steps
tidy_gii_data <- gii_data %>%
  left_join(combined_noc_data, by = c("country" = "country_participate")) %>%  
  mutate(
    id = ifelse(is.na(NOC), paste(year, "Unknown", sep = "_"), paste(year, NOC, sep = "_"))
  ) %>%
  select(id, country, year, value) 


write_csv(tidy_gii_data, file.path(tidydata_ddir, "tidy_gii_data.csv"))

tidy_gii_world_data <-  gii_world_data %>%
  select( year, value) 
write_csv(tidy_gii_world_data, file.path(tidydata_ddir, "tidy_gii_world_data.csv"))

# Now we can see tidy versions of all the data
print(head(tidy_olympic_medal_data))
## # A tibble: 6 × 10
##   id       olympic_year olympic_rank region_participate medal_gold medal_silver
##   <chr>           <dbl>        <dbl> <chr>                   <dbl>        <dbl>
## 1 1896_USA         1896            1 United States              11            7
## 2 1896_GRE         1896            2 Greece                     10           18
## 3 1896_GER         1896            3 Germany                     6            5
## 4 1896_GER         1896            3 Germany                     6            5
## 5 1896_GER         1896            3 Germany                     6            5
## 6 1896_FRA         1896            4 France                      5            4
## # ℹ 4 more variables: medal_bronze <dbl>, medal_total <dbl>,
## #   country_host <dbl>, NOC <chr>
print(head(tidy_olympic_female_medals2020))
## # A tibble: 6 × 7
##   Athlete               Nat    Gold Silver Bronze Total  Year
##   <chr>                 <chr> <dbl>  <dbl>  <dbl> <dbl> <dbl>
## 1 Emma McKeon           AUS       4      0      2     6  2020
## 2 Lisa Carrington       NZL       3      0      0     3  2020
## 3 Elaine Thompson-Herah JAM       3      0      0     3  2020
## 4 Kaylee McKeown        AUS       3      0      0     3  2020
## 5 Katie Ledecky         USA       2      2      0     4  2020
## 6 Ariarne Titmus        AUS       2      1      1     4  2020
print(head(tidy_athlete_events))
## # A tibble: 6 × 8
##      ID Name                     Sex   NOC    Year  Gold Silver Bronze
##   <dbl> <chr>                    <chr> <chr> <int> <int>  <int>  <int>
## 1     1 A Dijiang                M     CHN    1992     0      0      0
## 2     2 A Lamusi                 M     CHN    2012     0      0      0
## 3     3 Gunnar Nielsen Aaby      M     DEN    1920     0      0      0
## 4     4 Edgar Lindenau Aabye     M     DEN    1900     1      0      0
## 5     5 Christine Jacoba Aaftink F     NED    1988     0      0      0
## 6     5 Christine Jacoba Aaftink F     NED    1992     0      0      0
print(head(tidy_olympic_participators2020))
## # A tibble: 6 × 5
##   id       Country Event Gender Participants
##   <chr>    <chr>   <chr> <chr>         <dbl>
## 1 2020_AFG AFG     ATH   Male              1
## 2 2020_AFG AFG     ATH   Female            1
## 3 2020_AFG AFG     SHO   Male              1
## 4 2020_AFG AFG     SWM   Male              1
## 5 2020_AFG AFG     TKW   Male              1
## 6 2020_AFG AFG     Total Male              4
print(head(tidy_gii_data))
## # A tibble: 6 × 4
##   id       country     year  value
##   <chr>    <chr>       <chr> <dbl>
## 1 2008_AFG Afghanistan 2008  0.69 
## 2 2009_AFG Afghanistan 2009  0.696
## 3 2010_AFG Afghanistan 2010  0.707
## 4 2011_AFG Afghanistan 2011  0.718
## 5 2012_AFG Afghanistan 2012  0.734
## 6 2013_AFG Afghanistan 2013  0.712
print(head(tidy_gii_world_data))
## # A tibble: 1 × 2
##   year  value
##   <chr> <dbl>
## 1 2022  0.485

The above tables show row data have become tidy tabular data

4.2: Prepare data

I make the data ready to use by following steps: 1.Summarized total participants by gender, region and year 2.Summarized medal result by gender, region and year 3.Add a uniquely identifiable ID column in each table using year and region 4.Calculate host count and rank 5.Calculate historical Olympic medal tally by country

readydata_ddir <- "output/readydata"
if (!dir.exists(readydata_ddir)) dir.create(readydata_ddir)

# FIRST TRANSFORMATIONS: Summarize the data (1896-2016) by country and year, while separately summarizing medals won by men and women.And add ID for future convenience.


country_year_summary <- tidy_athlete_events %>%
  group_by(NOC, Year) %>%
  summarise(
    male_participants = sum(Sex == "M", na.rm = TRUE),
    female_participants = sum(Sex == "F", na.rm = TRUE),
    total_participants = male_participants + female_participants,
    male_gold_medals = sum(Gold == 1 & Sex == "M", na.rm = TRUE),
    male_silver_medals = sum(Silver == 1 & Sex == "M", na.rm = TRUE),
    male_bronze_medals = sum(Bronze == 1 & Sex == "M", na.rm = TRUE),
    female_gold_medals = sum(Gold == 1 & Sex == "F", na.rm = TRUE),
    female_silver_medals = sum(Silver == 1 & Sex == "F", na.rm = TRUE),
    female_bronze_medals = sum(Bronze == 1 & Sex == "F", na.rm = TRUE),
    male_total_medals = male_gold_medals + male_silver_medals + male_bronze_medals,
    female_total_medals = female_gold_medals + female_silver_medals + female_bronze_medals,
    total_medals = male_total_medals + female_total_medals,
    .groups = "drop"
  )

deduplicated_combined_noc_data <- combined_noc_data %>%
  distinct(NOC, .keep_all = TRUE)
ready_country_year_summary <- country_year_summary %>%
  left_join(deduplicated_combined_noc_data, by = "NOC") %>%  # Join with deduplicated NOC data
  mutate(
    id = paste(Year, NOC, sep = "_")  # Create ID
  ) %>%
  rename(Region = NOC) %>%  # Rename NOC to Region
  select(id, Region, everything(), -country_participate)

write_csv(ready_country_year_summary, file.path(readydata_ddir, "ready_country_year_summary.csv"))
print(head(ready_country_year_summary))
## # A tibble: 6 × 15
##   id       Region  Year male_participants female_participants total_participants
##   <chr>    <chr>  <int>             <int>               <int>              <int>
## 1 1936_AFG AFG     1936                15                   0                 15
## 2 1948_AFG AFG     1948                25                   0                 25
## 3 1956_AFG AFG     1956                12                   0                 12
## 4 1960_AFG AFG     1960                12                   0                 12
## 5 1964_AFG AFG     1964                 8                   0                  8
## 6 1968_AFG AFG     1968                 5                   0                  5
## # ℹ 9 more variables: male_gold_medals <int>, male_silver_medals <int>,
## #   male_bronze_medals <int>, female_gold_medals <int>,
## #   female_silver_medals <int>, female_bronze_medals <int>,
## #   male_total_medals <int>, female_total_medals <int>, total_medals <int>
# SECOND TRANSFORMATIONS: Turn the tidy_olympic_participators2020 into a long format and count the total number of people so that the data is the same as the ready_country_year_summary format


tidy_olympic_participators2020 <- read_csv("output/tidydata/tidy_olympic_participators2020.csv")
ready_olympic_participators2020 <- tidy_olympic_participators2020 %>%
  pivot_wider(
    names_from = Gender,
    values_from = Participants
  ) %>%
  rename(
    Region = Country,        
    male_participants = Male,    
    female_participants = Female
  ) %>%
  mutate(
    male_participants = as.numeric(sapply(male_participants, function(x) ifelse(is.null(x), 0, x))),  
    female_participants = as.numeric(sapply(female_participants, function(x) ifelse(is.null(x), 0, x))) # Flatten and convert to numeric
  ) %>%
  group_by(id, Region) %>%  
  summarise(
    male_participants = sum(male_participants, na.rm = TRUE),   
    female_participants = sum(female_participants, na.rm = TRUE), 
    total_participants = male_participants + female_participants,
    .groups = "drop"
  ) %>%
  mutate(Year = 2020) # Add a fixed Year column as 2020

ready_olympic_participators2020 <- ready_olympic_participators2020 %>%
  select(id, Region, Year, male_participants, female_participants, total_participants)

write_csv(ready_olympic_participators2020, file.path(readydata_ddir, "ready_olympic_participators2020.csv"))
print(head(ready_olympic_participators2020))
## # A tibble: 6 × 6
##   id       Region  Year male_participants female_participants total_participants
##   <chr>    <chr>  <dbl>             <dbl>               <dbl>              <dbl>
## 1 2020_AFG AFG     2020                 8                   2                 10
## 2 2020_ALB ALB     2020                12                   6                 18
## 3 2020_ALG ALG     2020                50                  26                 76
## 4 2020_AND AND     2020                 2                   2                  4
## 5 2020_ANG ANG     2020                 8                  32                 40
## 6 2020_ANT ANT     2020                 6                   6                 12
# THIRD TRANSFORMATIONS: Rename columns add IDs to tidy_olympic_female_medals2020, tidy_olympic_medal_data,making  their format looks like ready_country_year_summary


tidy_olympic_female_medals2020 <- read_csv("output/tidydata/tidy_olympic_female_medals2020.csv")
ready_olympic_female_medals2020 <- tidy_olympic_female_medals2020 %>%
  rename(
    Region = Nat  
  ) %>%
  select(-Athlete) %>%  
  group_by(Region, Year) %>% 
  summarise(
    female_gold_medals = sum(Gold, na.rm = TRUE),
    female_silver_medals = sum(Silver, na.rm = TRUE),
    female_bronze_medals = sum(Bronze, na.rm = TRUE),
    female_total_medals = female_gold_medals + female_silver_medals + female_bronze_medals,  # Calculate total medals
    .groups = "drop" 
  ) %>%
  mutate(
    id = paste(Year, Region, sep = "_")
  ) %>%
  select(id, Region, Year, female_gold_medals, female_silver_medals, female_bronze_medals, female_total_medals)   # Reorder columns

tidy_olympic_medal_data <- read_csv("output/tidydata/tidy_olympic_medal_data.csv")
ready_olympic_medal_data <- tidy_olympic_medal_data %>%
  rename(
    Year = olympic_year,                  
    Region = NOC,                         
    rank = olympic_rank,                  
    whole_name = region_participate,      
    total_gold_medals = medal_gold,       
    total_silver_medals = medal_silver,   
    total_bronze_medals = medal_bronze,   
    total_medals = medal_total            
  ) %>%
  select(
    Year, Region, rank, whole_name, total_gold_medals, total_silver_medals, total_bronze_medals, total_medals,country_host
  )   # Select and reorder the required columns

write_csv(ready_olympic_female_medals2020, file.path(readydata_ddir, "ready_olympic_female_medals2020.csv"))
write_csv(ready_olympic_medal_data, file.path(readydata_ddir, "ready_olympic_medal_data.csv"))
print(head(ready_olympic_female_medals2020))
## # A tibble: 6 × 7
##   id       Region  Year female_gold_medals female_silver_medals
##   <chr>    <chr>  <dbl>              <dbl>                <dbl>
## 1 2020_AUS AUS     2020                 13                    1
## 2 2020_BRA BRA     2020                  1                    1
## 3 2020_CAN CAN     2020                  1                    1
## 4 2020_CHN CHN     2020                  9                    4
## 5 2020_GBR GBR     2020                  2                    2
## 6 2020_HUN HUN     2020                  1                    1
## # ℹ 2 more variables: female_bronze_medals <dbl>, female_total_medals <dbl>
print(head(ready_olympic_medal_data))
## # A tibble: 6 × 9
##    Year Region  rank whole_name    total_gold_medals total_silver_medals
##   <dbl> <chr>  <dbl> <chr>                     <dbl>               <dbl>
## 1  1896 USA        1 United States                11                   7
## 2  1896 GRE        2 Greece                       10                  18
## 3  1896 GER        3 Germany                       6                   5
## 4  1896 GER        3 Germany                       6                   5
## 5  1896 GER        3 Germany                       6                   5
## 6  1896 FRA        4 France                        5                   4
## # ℹ 3 more variables: total_bronze_medals <dbl>, total_medals <dbl>,
## #   country_host <dbl>
# FOURTH TRANSFORMATIONS: Calculate host count and rank


host_count_rank <- ready_olympic_medal_data %>%
  group_by(Region) %>%
  summarise(
    host_count = sum(country_host == 1, na.rm = TRUE),  # Count hosting years
    .groups = "drop"
  ) %>%
  arrange(desc(host_count)) %>%  # Rank by host count, descending
  mutate(rank = row_number())  # Add ranking column
host_count_rank <- host_count_rank %>%
  select(rank, Region, host_count, everything())
print(head(host_count_rank, 10))
## # A tibble: 10 × 3
##     rank Region host_count
##    <int> <chr>       <int>
##  1     1 FRA             4
##  2     2 GER             4
##  3     3 USA             3
##  4     4 GRE             2
##  5     5 JPN             2
##  6     6 AUS             1
##  7     7 AUT             1
##  8     8 BEL             1
##  9     9 BRA             1
## 10    10 CAN             1
# FIFTH TRANSFORMATIONS: Calculate historical Olympic medal tally by country
historical_medal_tally <- ready_olympic_medal_data %>%
  group_by(Region) %>%
  summarise(
    total_gold_medals = sum(total_gold_medals, na.rm = TRUE), # Sum of gold medals
    total_silver_medals = sum(total_silver_medals, na.rm = TRUE), # Sum of silver medals
    total_bronze_medals = sum(total_bronze_medals, na.rm = TRUE), # Sum of bronze medals
    total_medals = sum(total_medals, na.rm = TRUE), # Total medals (gold + silver + bronze)
    .groups = "drop"
  ) %>%
  arrange(desc(total_medals)) %>%  # Rank by total medals
  mutate(rank = row_number())      # Add ranking column
historical_medal_tally <- historical_medal_tally %>%
  select(rank, Region, total_medals, everything())
print(head(historical_medal_tally, 10))
## # A tibble: 10 × 6
##     rank Region total_medals total_gold_medals total_silver_medals
##    <int> <chr>         <dbl>             <dbl>               <dbl>
##  1     1 GER            2837               891                 925
##  2     2 USA            2768              1107                 879
##  3     3 RUS            1862               690                 573
##  4     4 GBR             991               305                 342
##  5     5 FRA             829               245                 280
##  6     6 CHN             727               303                 226
##  7     7 ITA             667               231                 206
##  8     8 AUS             610               183                 196
##  9     9 JPN             544               189                 162
## 10    10 HUN             532               187                 162
## # ℹ 1 more variable: total_bronze_medals <dbl>

5. [Create queries and visualizations]

5.1: Explore data with SQL

I used the following queries to lay a foundation for visualization: 1.I calculated and compared the participation rates of male and female athletes across different Olympic years and regions(includes “world”), revealing trends in gender equality in sports. 2.I deine a “medal efficiency” metric, demonstrating each region(includes “world”) and each gender’s effectiveness in converting participation into medals.

# As we've linked all the tables with ids, now we can use SQL to build a detailed database of Olympic participation and awards! It contains the region, year, gender, number of medals won separately.

# Let us start with Gender participation rate.
library(DBI)
library(RSQLite)

# Define directory for database
database_ddir <- "output/database"
if (!dir.exists(database_ddir)) dir.create(database_ddir)
db <- dbConnect(RSQLite::SQLite(), paste0(database_ddir, "olympic-db.sqlite"))

dbWriteTable(db, "country_year_summary", ready_country_year_summary, overwrite = TRUE)
dbWriteTable(db, "olympic_female_medals2020", ready_olympic_female_medals2020, overwrite = TRUE)
dbWriteTable(db, "olympic_medal_data", ready_olympic_medal_data, overwrite = TRUE)
dbWriteTable(db, "olympic_participators2020", ready_olympic_participators2020, overwrite = TRUE)

# SQL Query to Combine 2020 Data Explicitly with Historical Data
combine_query <- "
  SELECT
    cys.id AS id,
    cys.Region AS Region,
    cys.Year AS Year,
    COALESCE(cys.male_participants, part.male_participants) AS male_participants,
    COALESCE(cys.female_participants, part.female_participants) AS female_participants,
    COALESCE(cys.total_participants, part.total_participants) AS total_participants
  FROM country_year_summary cys
  FULL OUTER JOIN olympic_participators2020 part ON cys.id = part.id
  WHERE cys.Year IS NOT NULL OR part.Year = 2020
"
# Execute the query and save new table
combined_data <- dbGetQuery(db, combine_query)
dbWriteTable(db, "combined_data_with_2020", combined_data, overwrite = TRUE)

5.1.1: Calculate gender participation rate

# Gender participation rates by country and year
id_gender_participation_rate_query <- "
  SELECT
      id, Region, Year,
      COALESCE(male_participants, 0) AS male_participants,
      COALESCE(female_participants, 0) AS female_participants,
      COALESCE(total_participants, 0) AS total_participants,
      ROUND(COALESCE(male_participants, 0) * 1.0 / COALESCE(total_participants, 1), 3) AS male_participation_rate,
      ROUND(COALESCE(female_participants, 0) * 1.0 / COALESCE(total_participants, 1), 3) AS female_participation_rate
  FROM (
      SELECT
          id, Region, Year, male_participants, female_participants,
          total_participants
      FROM country_year_summary
      UNION ALL
      SELECT
          id, Region, Year, male_participants, female_participants,
          total_participants
      FROM olympic_participators2020
  ) combined
"
id_gender_participation_rate <- dbGetQuery(db, id_gender_participation_rate_query)
write_csv(id_gender_participation_rate, file.path(database_ddir, "id_gender_participation_rate.csv"))

# Global gender participation rates and year
historical_gender_participation_rate_query <- "
  SELECT
      Year,
      SUM(COALESCE(male_participants, 0)) AS total_male_participants,
      SUM(COALESCE(female_participants, 0)) AS total_female_participants,
      SUM(COALESCE(total_participants, 0)) AS total_participants,
      ROUND(SUM(COALESCE(male_participants, 0)) * 1.0 / SUM(COALESCE(total_participants, 0)), 3) AS global_male_participation_rate,
      ROUND(SUM(COALESCE(female_participants, 0)) * 1.0 / SUM(COALESCE(total_participants, 0)), 3) AS global_female_participation_rate
  FROM (
      SELECT
          Year, male_participants, female_participants, total_participants
      FROM country_year_summary
      UNION ALL
      SELECT
          Year, male_participants, female_participants, total_participants
      FROM olympic_participators2020
  ) combined
  GROUP BY Year
  ORDER BY Year
"
historical_gender_participation_rate <- dbGetQuery(db, historical_gender_participation_rate_query)
write_csv(historical_gender_participation_rate, file.path(database_ddir, "historical_gender_participation_rate.csv"))

# Close the database connection
dbDisconnect(db)

5.1.2: Calculate medal efficiency

# Then, We introduce "Medal Efficiency" to measure how effectively a country converts its participants into medals.
# we define: Gold = 4 point, Silver = 2 point, Bronze = 1 point, None medal = 0 point;
# Medal Efficiency = sum(points)/sum(Participants)

database_ddir <- "output/database"
if (!dir.exists(database_ddir)) dir.create(database_ddir)
db <- dbConnect(RSQLite::SQLite(), paste0(database_ddir, "olympic-db.sqlite"))

dbWriteTable(db, "country_year_summary", ready_country_year_summary, overwrite = TRUE)
dbWriteTable(db, "olympic_female_medals2020", ready_olympic_female_medals2020, overwrite = TRUE)
dbWriteTable(db, "olympic_medal_data", ready_olympic_medal_data, overwrite = TRUE)
dbWriteTable(db, "olympic_participators2020", ready_olympic_participators2020, overwrite = TRUE)

# Prepare data of 2020
dbExecute(db, "DELETE FROM country_year_summary WHERE Year = 2020") 
## [1] 0
update_2020_data <- "
INSERT OR REPLACE INTO country_year_summary (
    id,  Region, Year, 
    male_participants, female_participants, 
    male_gold_medals,  male_silver_medals, male_bronze_medals, 
    female_gold_medals, female_silver_medals, female_bronze_medals
)
SELECT 
    '2020_' || op.Region AS id,
    op.Region,
    2020 AS Year,
    op.male_participants, op.female_participants,
    (omd.total_gold_medals - ofm.female_gold_medals) AS male_gold_medals,
    (omd.total_silver_medals - ofm.female_silver_medals) AS male_silver_medals,
    (omd.total_bronze_medals - ofm.female_bronze_medals) AS male_bronze_medals,
    ofm.female_gold_medals,
    ofm.female_silver_medals,
    ofm.female_bronze_medals
FROM 
    olympic_participators2020 op
JOIN 
    olympic_female_medals2020 ofm ON op.Region = ofm.Region
JOIN
    olympic_medal_data omd ON op.Region = omd.Region AND omd.Year = 2020
"
dbExecute(db, update_2020_data)
## [1] 14
# Calculate historical medal efficiency by ID
dbExecute(db, "DROP TABLE IF EXISTS id_medal_efficiency")
## [1] 0
id_medal_efficiency_query <- "
CREATE TABLE id_medal_efficiency AS
SELECT
    id, Region, Year,
    ROUND(COALESCE((4 * male_gold_medals + 2 * male_silver_medals + male_bronze_medals) * 1.0 / NULLIF(male_participants, 0), 0), 3) AS male_medal_efficiency,
    ROUND(COALESCE((4 * female_gold_medals + 2 * female_silver_medals + female_bronze_medals) * 1.0 / NULLIF(female_participants, 0), 0), 3) AS female_medal_efficiency,
    ROUND(COALESCE((4 * (male_gold_medals + female_gold_medals) + 2 * (male_silver_medals + female_silver_medals) + (male_bronze_medals + female_bronze_medals)) * 1.0 / NULLIF((male_participants + female_participants), 0), 0), 3) AS total_medal_efficiency
FROM country_year_summary;
"
dbExecute(db, id_medal_efficiency_query)
## [1] 0
# Calculate global annual medal efficiency by gender
dbExecute(db, "DROP TABLE IF EXISTS global_medal_efficiency")
## [1] 0
global_medal_efficiency_query <- "
CREATE TABLE global_medal_efficiency AS
SELECT
    Year,
    ROUND(COALESCE(SUM(4 * male_gold_medals + 2 * male_silver_medals + male_bronze_medals) * 1.0 / NULLIF(SUM(male_participants), 0), 0), 3) AS global_male_medal_efficiency,
    ROUND(COALESCE(SUM(4 * female_gold_medals + 2 * female_silver_medals + female_bronze_medals) * 1.0 / NULLIF(SUM(female_participants), 0), 0), 3) AS global_female_medal_efficiency,
    ROUND(COALESCE(SUM(4 * (male_gold_medals + female_gold_medals) + 2 * (male_silver_medals + female_silver_medals) + (male_bronze_medals + female_bronze_medals)) * 1.0 / NULLIF(SUM(male_participants + female_participants), 0), 0), 3) AS global_total_medal_efficiency
FROM country_year_summary
GROUP BY Year;
"
dbExecute(db, global_medal_efficiency_query)
## [1] 0
id_medal_efficiency <- dbGetQuery(db, "SELECT * FROM id_medal_efficiency")
global_medal_efficiency <- dbGetQuery(db, "SELECT * FROM global_medal_efficiency")


write_csv(id_medal_efficiency, file.path(database_ddir, "id_medal_efficiency.csv"))
write_csv(global_medal_efficiency, file.path(database_ddir, "global_medal_efficiency.csv"))

dbDisconnect(db)

5.2: Create visualization

5.2.1: Global Gender Participation vs. GII

library(plotly)

vis_ddir <- "output/vis"
if (!dir.exists(vis_ddir)) dir.create(vis_ddir)

tidy_gii_world_data <- read_csv("output/tidydata/tidy_gii_world_data.csv") %>%
  mutate(Year = as.numeric(year))

historical_gender_participation_rate<- read_csv("output/database/historical_gender_participation_rate.csv")

historical_gender_participation_rate <- historical_gender_participation_rate %>%
  mutate(total_participation = global_male_participation_rate + global_female_participation_rate,
         male_prop = global_male_participation_rate / total_participation,
         female_prop = global_female_participation_rate / total_participation,
         male_y_start = 1 - male_prop,  # Starting point for male
         male_y_end = 1,               # Ending point for male at 100%
         female_y_start = 0,           # Starting point for female at 0%
         female_y_end = female_prop,   # Ending point for female
         xmin = Year - 0.9,           # Adjusted for better visualization
         xmax = Year + 0.9)

participation_plot <- ggplot(historical_gender_participation_rate, aes(x = Year)) +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = male_y_start, ymax = male_y_end, fill = "Male")) +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = female_y_start, ymax = female_y_end, fill = "Female")) +
  geom_line(aes(y = female_prop, group = 1, color = "Female Participation"), size = 0.8 ) +
  geom_point(aes(y = female_prop), color = "yellow", size = 1, shape = 21, fill = "yellow") +
  geom_line(data = tidy_gii_world_data, aes(x = year, y = value, group = 1, color = "GII Index"), size = 0.8) +
  scale_fill_manual(values = c("Male" = "#ADD8E6", "Female" = "#FFB6C1")) +
  scale_color_manual(values = c("Female Participation" = "yellow", "GII Index" = "red")) +
  labs(title = "Global Participation in Summer Olympics (1896-2020) with GII",
       x = "Year",
       y = "Proportion of Total Participation",
       fill = "Gender",
       color = "Index & Trends") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",  # Move legend to the bottom
    legend.title = element_blank()  # Optionally remove legend titles
  )

# Convert to interactive plotly object
participation_plotly <- ggplotly(participation_plot, tooltip = c("x", "y", "text")) %>%
  layout(
    hovermode = "closest",
    xaxis = list(showspikes = TRUE, spikedash = "solid", spikecolor = "grey", spikethickness = 0.5),
    yaxis = list(showspikes = TRUE, spikedash = "solid", spikecolor = "grey", spikethickness = 0.5)
  )

# Save the interactive plot to HTML
htmlwidgets::saveWidget(participation_plotly, file.path(vis_ddir, "participation_plotly.html"))
print(participation_plotly)

The first visualization showed a trend where lower Global Gender Inequality Index (GII) values correlate with higher female participation rates in the Olympics, suggesting that gender equality fosters greater female involvement in sports.

5.2.2: Medal Efficiency by Gender

global_medal_efficiency <- read_csv("output/database/global_medal_efficiency.csv")

# Create a base line chart
efficiency_plot <- ggplot(global_medal_efficiency, aes(x = Year)) +
  geom_line(aes(y = global_male_medal_efficiency, color = "Male Efficiency"), size = 1) +
  geom_line(aes(y = global_female_medal_efficiency, color = "Female Efficiency"), size = 1) +
  geom_line(aes(y = global_total_medal_efficiency, color = "Global Efficiency"), size = 0.5, linetype = "dashed") +
  scale_color_manual(values = c("Male Efficiency" = "#ADD8E6",  
                                "Female Efficiency" = "#FFB6C1",  
                                "Global Efficiency" = "#636363")) +
  labs(
    title = "Medal Efficiency by Gender in Summer Olympics (1896-2020)",
    x = "Year",
    y = "Medal Efficiency",
    color = "Category"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.title = element_blank()
  )

# Convert to an interactive Plotly 
efficiency_plotly <- ggplotly(efficiency_plot, tooltip = c("x", "y", "color")) %>%
  layout(
    hovermode = "closest",
    xaxis = list(showspikes = TRUE),
    yaxis = list(showspikes = TRUE)
  )


htmlwidgets::saveWidget(efficiency_plotly,file.path(vis_ddir, "efficiency_plotly.html"))
efficiency_plotly

The second chart revealed that despite historically lower participation rates, women’s medal efficiency in the Olympics has generally outperformed that of men, indicating that female athletes achieve high success relative to their participation numbers.

5.2.3: UK Case Study on Gender Equality and Performance

id_medal_efficiency <- read.csv("output/database/id_medal_efficiency.csv")
# Now let's define a concept: Gender equality level
# I use the reciprocal of the gender inequality index to represent the gender equality level, but this is not a serious academic concept, just for convenience

# Filter data for the UK from 1990 to 2020
uk_medal_efficiency <- id_medal_efficiency %>%
  filter(Region == "GBR", Year >= 1990, Year <= 2020) %>%
  select(Year, female_medal_efficiency)

us_gii_data <- tidy_gii_data %>%
  filter(country == "United Kingdom") %>%
  mutate(year = as.numeric(year),  
         GII_reciprocal = 1 - value) %>%
  filter(year >= 1990, year <= 2020) %>%
  select(year, GII_reciprocal)

uk_merged_data <- left_join(uk_medal_efficiency, us_gii_data, by = c("Year" = "year"))

# Create a ggplot object
uk_plot <- ggplot(uk_merged_data, aes(x = GII_reciprocal, y = female_medal_efficiency)) +
  geom_point(color = "#636363") +
  geom_smooth(method = "lm", color = "#EF476F") +
  labs(
    title = "UK Case Study on Gender Equality and Performance (1990-2020)",
    x = " Gender Equality Level (1-GII)",
    y = "Female Medal Efficiency"
  ) +
  theme_minimal()

# Convert to an interactive Plotly graph
uk_plotly <- ggplotly(uk_plot, tooltip = c("x", "y")) %>%
  layout(
    hovermode = "closest",
    xaxis = list(showspikes = TRUE),
    yaxis = list(showspikes = TRUE)
  )


htmlwidgets::saveWidget(uk_plotly,file.path(vis_ddir, "uk_plotly.html"))
uk_plotly

The third visualization used the UK as a case study to demonstrate a positive correlation between gender equality and female athletic performance, supporting the idea that societal gender equality advances women’s success in sports.

6. [Storage Structure ]

Finally, you can see the storage structure of the entire project through the following code, where rowdata contains csv, JSON and PNG, and output has processed tables and plotly HTML.

current_dir <- getwd()
files_list <- list.files(current_dir, recursive = TRUE, full.names = FALSE)
files_df <- data.frame(FileName = files_list)
print(files_df)
##                                                                                      FileName
## 1                                                            MY472-AT24-final-instructions.md
## 2                                         MY472-AT24-final-report_files/figure-html/3.3-1.png
## 3                                                                 MY472-AT24-final-report.Rmd
## 4                                                 output/database/global_medal_efficiency.csv
## 5                                    output/database/historical_gender_participation_rate.csv
## 6                                            output/database/id_gender_participation_rate.csv
## 7                                                     output/database/id_medal_efficiency.csv
## 8                                                            output/databaseolympic-db.sqlite
## 9                                             output/readydata/ready_country_year_summary.csv
## 10                                       output/readydata/ready_olympic_female_medals2020.csv
## 11                                              output/readydata/ready_olympic_medal_data.csv
## 12                                       output/readydata/ready_olympic_participators2020.csv
## 13                                                    output/tidydata/tidy_athlete_events.csv
## 14                                                          output/tidydata/tidy_gii_data.csv
## 15                                                    output/tidydata/tidy_gii_world_data.csv
## 16                                         output/tidydata/tidy_olympic_female_medals2020.csv
## 17                                                output/tidydata/tidy_olympic_medal_data.csv
## 18                                         output/tidydata/tidy_olympic_participators2020.csv
## 19                   output/vis/efficiency_plotly_files/crosstalk-1.2.1/css/crosstalk.min.css
## 20                         output/vis/efficiency_plotly_files/crosstalk-1.2.1/js/crosstalk.js
## 21                     output/vis/efficiency_plotly_files/crosstalk-1.2.1/js/crosstalk.js.map
## 22                     output/vis/efficiency_plotly_files/crosstalk-1.2.1/js/crosstalk.min.js
## 23                 output/vis/efficiency_plotly_files/crosstalk-1.2.1/js/crosstalk.min.js.map
## 24                     output/vis/efficiency_plotly_files/crosstalk-1.2.1/scss/crosstalk.scss
## 25                         output/vis/efficiency_plotly_files/htmltools-fill-0.5.8.1/fill.css
## 26                        output/vis/efficiency_plotly_files/htmlwidgets-1.6.4/htmlwidgets.js
## 27                         output/vis/efficiency_plotly_files/jquery-3.5.1/jquery-AUTHORS.txt
## 28                                  output/vis/efficiency_plotly_files/jquery-3.5.1/jquery.js
## 29                              output/vis/efficiency_plotly_files/jquery-3.5.1/jquery.min.js
## 30                             output/vis/efficiency_plotly_files/jquery-3.5.1/jquery.min.map
## 31                         output/vis/efficiency_plotly_files/plotly-binding-4.10.4/plotly.js
## 32    output/vis/efficiency_plotly_files/plotly-htmlwidgets-css-2.11.1/plotly-htmlwidgets.css
## 33                 output/vis/efficiency_plotly_files/plotly-main-2.11.1/plotly-latest.min.js
## 34                        output/vis/efficiency_plotly_files/typedarray-0.1/typedarray.min.js
## 35                                                          output/vis/efficiency_plotly.html
## 36                output/vis/participation_plotly_files/crosstalk-1.2.1/css/crosstalk.min.css
## 37                      output/vis/participation_plotly_files/crosstalk-1.2.1/js/crosstalk.js
## 38                  output/vis/participation_plotly_files/crosstalk-1.2.1/js/crosstalk.js.map
## 39                  output/vis/participation_plotly_files/crosstalk-1.2.1/js/crosstalk.min.js
## 40              output/vis/participation_plotly_files/crosstalk-1.2.1/js/crosstalk.min.js.map
## 41                  output/vis/participation_plotly_files/crosstalk-1.2.1/scss/crosstalk.scss
## 42                      output/vis/participation_plotly_files/htmltools-fill-0.5.8.1/fill.css
## 43                     output/vis/participation_plotly_files/htmlwidgets-1.6.4/htmlwidgets.js
## 44                      output/vis/participation_plotly_files/jquery-3.5.1/jquery-AUTHORS.txt
## 45                               output/vis/participation_plotly_files/jquery-3.5.1/jquery.js
## 46                           output/vis/participation_plotly_files/jquery-3.5.1/jquery.min.js
## 47                          output/vis/participation_plotly_files/jquery-3.5.1/jquery.min.map
## 48                      output/vis/participation_plotly_files/plotly-binding-4.10.4/plotly.js
## 49 output/vis/participation_plotly_files/plotly-htmlwidgets-css-2.11.1/plotly-htmlwidgets.css
## 50              output/vis/participation_plotly_files/plotly-main-2.11.1/plotly-latest.min.js
## 51                     output/vis/participation_plotly_files/typedarray-0.1/typedarray.min.js
## 52                                                       output/vis/participation_plotly.html
## 53                           output/vis/uk_plotly_files/crosstalk-1.2.1/css/crosstalk.min.css
## 54                                 output/vis/uk_plotly_files/crosstalk-1.2.1/js/crosstalk.js
## 55                             output/vis/uk_plotly_files/crosstalk-1.2.1/js/crosstalk.js.map
## 56                             output/vis/uk_plotly_files/crosstalk-1.2.1/js/crosstalk.min.js
## 57                         output/vis/uk_plotly_files/crosstalk-1.2.1/js/crosstalk.min.js.map
## 58                             output/vis/uk_plotly_files/crosstalk-1.2.1/scss/crosstalk.scss
## 59                                 output/vis/uk_plotly_files/htmltools-fill-0.5.8.1/fill.css
## 60                                output/vis/uk_plotly_files/htmlwidgets-1.6.4/htmlwidgets.js
## 61                                 output/vis/uk_plotly_files/jquery-3.5.1/jquery-AUTHORS.txt
## 62                                          output/vis/uk_plotly_files/jquery-3.5.1/jquery.js
## 63                                      output/vis/uk_plotly_files/jquery-3.5.1/jquery.min.js
## 64                                     output/vis/uk_plotly_files/jquery-3.5.1/jquery.min.map
## 65                                 output/vis/uk_plotly_files/plotly-binding-4.10.4/plotly.js
## 66            output/vis/uk_plotly_files/plotly-htmlwidgets-css-2.11.1/plotly-htmlwidgets.css
## 67                         output/vis/uk_plotly_files/plotly-main-2.11.1/plotly-latest.min.js
## 68                                output/vis/uk_plotly_files/typedarray-0.1/typedarray.min.js
## 69                                                                  output/vis/uk_plotly.html
## 70                                                                                  README.md
## 71                              rowdata/120-years-of-olympic-history-athletes-and-results.zip
## 72               rowdata/120-years-of-olympic-history-athletes-and-results/athlete_events.csv
## 73                  rowdata/120-years-of-olympic-history-athletes-and-results/noc_regions.csv
## 74                                                                    rowdata/GII_diagram.png
## 75                                                                      rowdata/hdr-data.json
## 76                                                                   rowdata/hdr-data(1).json
## 77                                                     rowdata/Olympic_Female_Medals_2020.csv
## 78                                                             rowdata/olympic_medal_data.csv
## 79                                                      rowdata/olympic_participators2020.csv
library(tidyverse)

rmd_file <- "MY472-AT24-final-report.Rmd" # path to your Rmd file

read_file(rmd_file) %>% # read the file as a text file
  str_squish() %>% # remove all extra white space
  str_replace("^.+?output.+?[-]{3}", "") %>% # remove header
  str_replace_all("``` *[{].+?```", " ") %>% # remove code chunks
  str_replace_all("<![-].+?-->", " ") %>% # remove rmd comments
  str_replace_all("[!]?\\[.+?\\][(].+?[)]", " ") %>% # remove links
  str_replace_all("(^|\\s+)[^A-Za-z0-9]+", " ") %>% # remove symbols (1)
  str_replace_all("[^A-Za-z0-9]+($|\\s+)", " ") %>% # remove symbols (2)
  str_count("\\S+") %>% 
  paste("The document is", ., "words.") %>%
  print()
## [1] "The document is 681 words."
>>>>>>> 103afe631bd05b0cf645a264ad34b0e027198e03